\documentclass[10pt,a4paper]{article}
\usepackage[latin1]{inputenc}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{syntax} % Grammar writer
\usepackage{listings} % source code writer
\usepackage[usenames,dvipsnames]{xcolor}

% ----------------------------------------------------------
% Source code syntax highlight
\lstdefinelanguage{Emma}
{morekeywords={print,if,else,while,def,return,for,continue,break,class,null},
morekeywords={import,package,try,raise,catch,finally,elif,read,self},
sensitive=false,
morecomment=[l]{\#},
morestring=[b]",
morestring=[b]',
} 

\definecolor{dkgreen}{rgb}{0,0.6,0}
\definecolor{gray}{rgb}{0.5,0.5,0.5}
\definecolor{mauve}{rgb}{0.58,0,0.82}

\lstset{frame=tb,
  language=Emma,
  emph={and,xor,or,not},
  emphstyle=\color{dkgreen},
  aboveskip=3mm,
  belowskip=3mm,
  showstringspaces=false,
  columns=fixed,
  basicstyle={\small\ttfamily},
  numbers=none,
  numberstyle=\tiny\color{gray},
  keywordstyle=\color{Sepia},
  commentstyle=\color{blue},
  stringstyle=\color{mauve},
  breaklines=true,
  breakatwhitespace=true,
  tabsize=3
}
% End of syntax highlight
% ----------------------------------------------------------

\author{Yang Wang}
\date{}
\title{The Emma Language Reference}
\begin{document}
\maketitle
\tableofcontents

\section{Introduction}
This reference manual describes the Emma language and also serves as a  
language specification and development guidance.

The Emma language is a self-education and experimental project. 
The main purpose is for the author to study the theory and best practice related
to compiler design and implementation. 
The final product is envisioned to be a high level programming language similar
to Python in spirit, but also incorporate features from other languages such as
IDL (Interactive Data Language), C, Java. The style of this reference is also
influenced by the Python reference.

The basic target of the language is to be Turning complete. The ultimate
target is to be self-hosting, i.e.\ be able to compile itself.
The current development scope of the language is listed as follows:

\begin{itemize}
\item Functions (with recursive calls)
\item Array support
\item Basic I/O support
\item Object oriented programming support by class definitions
\item Interaction among source files with import
\item Error handling and interactive debugging similar to IDL
\item Short circuit logic
\item Garbage collection
\item Interface to use external C libraries
\end{itemize}


\subsection{Implementation}
The language will be prototyped in Python and implemented in C. 
An interactive environment will be provided as well as a batch
run mode. Any programs written in the language will be compiled
to bytecode for execution speed boost. This means
that a runtime environment of the language is always required to run
programs, since there will be no compilation to native machine code.

\subsection{Terminology}
\begin{description}
\item[System] The Emma language runtime environment and its standard library.
\item[Entity] A single program unit such as variable, function, class.
\end{description}

\subsection{Philosophy}
These are the principles for the design, implementation and usage of the language.
\begin{enumerate}
\item No redundant functionalities
\item Be straightforward
\end{enumerate}

\pagebreak

\section{Language Model}

\subsection{File organization}

\subsubsection{Module and package}
A file is a module while a directory containing modules is a package. 
The module name is the file name without the extension.
The package name is just the name of a directory.
Modules and packages have structure hierarchy just like
files and directories. 
We shall use file and module interchangeably in the rest of the reference.

It is recommended to use ``.em'' as the file extension for source file names.
Other extensions can be used. However, \lstinline$import$ and 
\lstinline$package$ assume the ``.em'' extension and hence 
do not work properly for other extensions.
In addition, file and directory names should follow the same naming convention
as variables for \lstinline$import$ and \lstinline$package$ to work.

\subsubsection{The import keyword}
The interoperability among different modules are supported by the 
\lstinline$import$ and \lstinline$package$ mechanisms. 
A module can gain access other module's functionalities by \lstinline$import$
the other module. Following pseudo-code describes the general usage of the
keyword.

\begin{verbatim}
import [ <pkg> . ] <module> [ . <name> ]
\end{verbatim}

Note square brackets indicate that the item in between them is optional.
\verb$<pkg>$ is the package (directory) path of the module (file).
It is allowed to have arbitrary long package path.
\verb$<module>$ is the module name, i.e. the source file name without
the file extension.
\verb$<name>$ is a name, which is either a variable, function or class,
defined in the module file. The asterisk symbol can be used as a wildcard
name to indicate all names in the module are to be imported. 

An \lstinline$import$ statement makes names in a module visible to user code.
When a name is given in the \lstinline$import$ statement, 
it can be referenced without qualifying with its module. 
Otherwise, the full qualified name, e.g. \verb$module.name$ has to been used.
Note the package path is not needed while referencing a name regardless
of the name presence in the \lstinline$import$ statement.

Following code snippets show a few examples how import works.

\begin{lstlisting}[caption=import statement example]
import foo.bar
import foo.bar.baz
bar.qux()
baz()
\end{lstlisting}

In above code fragment, \lstinline$foo$ is a package and 
\lstinline$bar$ is a module. 
The first import has the format of \verb$pkg.module$ and does not have
a name. 
This is why the usage of \lstinline$qux$ function in \lstinline$bar$ 
has to be qualified as \lstinline$bar.qux()$.

The second import has the format of \verb$pkg.module.name$. 
Hence the name \lstinline$baz$ can be used without module qualification.

\begin{lstlisting}[caption=import statement with the wildcard character]
import foo.bar.*
qux()
baz()
\end{lstlisting}

The above import shows the usage of the wildcard character. 
After the import, all names in module bar can be directly used by
user code without qualification.

Circular import cannot be resolved. 
It will be detected when during run time and error is to be reported
to help manual fix.

\subsubsection{The package keyword}
There is one missing piece in the import mechanism described in previous
section.
That is how file search works to locate the requested module.
The \lstinline$package$ keyword can be used to affect the search behavior.

The system uses an internal variable to keep a list of paths for file
searching. By default, the paths include the builtin library directory and
the current working directory. More paths can be added to the search
paths interactively or via configuration file. The current working directory
is always the first entry in the list of paths while the builtin library 
is the last in the list.

A module is searched in the list of paths beginning from the current
working directory. The first match is used and other matches (if any)
are ignored.

A \lstinline$package$ keyword indicates the file is part of
a package, which is some directory one or more level up in the file system
hierarchy. 
Any subsequent import related to this package is to be searched
only in the directory. This is how the \lstinline$package$ keyword
changes the behavior of module searching. 
Following is an example showing code fragment of a file named 
\lstinline$mymodule.em$.

\begin{lstlisting}[caption=Example of the package keyword]
package foo
import foo.bar
import foo.bar.qux
import baz
\end{lstlisting}

The first line declares the file is part of a package \lstinline$foo$.
The package must be a directory one or more level up to the file.
Assume the file's full path name is something like 
\verb$...foo/sub/mymodule.em$. The package is obviously the 
\verb$foo$ directory in the full path name. 
The following two import are related to the package. 
The search will then be conducted in the directory \verb$foo$.
The last import is not related to the package. 
Hence the normal search process will take place by starting with
the current working directory.

The effect of the \lstinline$package$ keyword is, as its name indicates,
useful for package developers.
When a module in the package imports another module in the same
package, it can simply declare the package and start import such as
\lstinline$import $\verb$package.module$ and it will work no matter where
the package itself is located.

\subsection{Data model}
\subsubsection{Types}
In principle, there is only one data type, i.e.\ object.
Everything is of the object type, including function and class.
Therefore it is really the class of an object that we
are interested. For convenience, we will use the term ``data type"
when talking about the class. But be aware that it is really 
the class that we are talking about.

The basic data types include integer, float, string, and null. 
Two aggregate data types, list and hash, are also provided via builtin
functions. Both of them are heterogeneous, that is they can have
different data types, including themselves, as their elements. 
Note that the key of a hash can only be one of the basic data types.
There is no dedicate boolean type. 
An empty list, empty hash, empty string, number 0 and a null are all 
evaluated to zero on condition test. 
Anything else by default is evaluated to one.

Function and class can only be defined in module level code.
This means no nested function or class definitions are allowed.
Nested definitions will be identified by the Lexer and error is 
to be reported (detailed later in Lexer section).

\subsubsection{Class}

\paragraph{Basic class}
All classes are derived from a basic class, named Object. 
This class provides default methods that enable many useful behaviors, 
such as comparison, string conversion, boolean value, etc.
The default methods can all be overridden by subclasses.
A full list of default methods are shown as follows:
\begin{description}
\item[__init__] 
\item[__compare__] compare to another object
\item[__str__] return a string representation of the object
\item[__bool__] return 0 or 1 as the boolean value of the object
\end{description}

\paragraph{User defined class}
User defined class is created by subclassing existing class. 
Note that the basic Object class is used as the superclass when none is 
specified. 
Multi-inheritance is not allowed.
However, classes can have arbitrary long inheritance chains.

Class works like a namespace and variables can be added to it without
declaration. 
The object of a class is a different namespace, which saves the object's
own copy of variables.
The mechanism of class and objects are similar to those of Python.
But there is an important difference, that is \lstinline$self$ is a 
reserved word.
In a class level, variables not qualified by \lstinline$self$ are treated
as static variables. In addition, methods do not use \lstinline$self$
as its first argument is also treated as static method.
Static variables and methods must be qualified by their class when used
outside of the direct class scope, including inside method scopes or 
completely outside of the class scope.

\subsection{Name, binding and environment}
We already knew everything is an object in the Emma language. 
Names are associated with objects so that they can be uniquely identified.
A name is simply an identifier in the source code. 
The association created between a name and an object is called binding.
When a variable is created, it is essentially done by
creating a binding between the variable name and its content.

Information about bindings are kept in environment, which works
like an hash indexed by names.
For an example, a module has an environment that keeps all bindings of 
the module.

\subsubsection{Environment and scope}
New environments are created for every module, function and class.
These environments are nested and the top level environment is the one
associated with the top level module, i.e.\ the module being directly
invoked by users. 

The environment also defines the scope and how names are searched during
running time. 
The top level environment defines the global scope. 
At a time, there is only one environment, called the current scope,
at the bottom level of the environment chain.

The names are first searched in the current scope.
If it is not found, it is then searched in the scope that is one
level higher than the current scope. 
The process continues until the name is found or errors out if no match.
New variables will be created in the current scope unless explicitly 
requested to be created in other scopes.

\subsubsection{Internals of object and its binding}
Under the hood, there are only two categories of objects that names
are bound with. 
One is closure and the other is simple object.

A closure is a combination of code object and the environment where
the code object is defined.
A code object is an internal representation of source code of either a 
module, a function or a class.
Note that the code object has the ability to contain other names, which
again may associate with either closure or simple object.

Simple objects represent integer, float, string variables and any literals.
Basically anything that cannot contain other names.

\subsection{Error handling}
Error handling is done with four keywords, \lstinline$try$, \lstinline$raise$
\lstinline$catch$, and \lstinline$finally$.
In general, a risky code is ran by placing it in the \lstinline$try$ block.
At least one \lstinline$catch$ block follows the \lstinline$try$ block 
to deal with possible exceptions. A \lstinline$finally$ block at the end
to do some housekeepings.

If an error is not catched by any of the \lstinline$catch$ blocks.
The program execution is stopped with error report and full call
stack information to help debug.


\pagebreak

\section{Lexical analysis}
\subsection{Line structure}
An Emma program is consisted of lines of texts. The line structure is
hence one of the basic component of a source program.

\subsubsection{Physical line}
A physical line is terminated by an end-of-line sequence (EOL). The
sequence is different on different platforms. Unix and alike use ASCII LF
(linefeed), while Windows uses ASCII sequence of CR LF (carriage return followed
by a linefeed). All of these sequences are treated equally as a single 
EOL symbol.

\subsubsection{Logical line}
A logical line can be a single physical line or multiple physical lines.
A logical line is terminated an EOL, which makes a physical line to be also 
a logical line.

The line continuation symbol, ``\textbackslash", joins two physical lines
by effectively ignoring the EOL symbol between them.
This makes it possible to have a logical line span across multiple 
physical lines. 
Note that only comments and EOL are allowed after the line continuation
symbol.

Simple statements can never cross the boundaries of a logical line.
For compound statements, the language syntax allows EOLs
to be in between statements inside a ``\{\}'' pair. 
Therefore compound statements may be consisted of multiple logical lines. 

Note that even for a compound statement, before the statement ends, 
EOL is still not allowed outside of the ``\{\}'' pair 

These rules are better clarified with the following example:

\begin{lstlisting}
# Line termination rules
if (x == 1) {
    y = 1
} else {
    y = 0
}
a = 5
\end{lstlisting}

Note EOLs can appear inside the ``\{\}'' pairs. 
But they cannot present outside of the ``\{\}'' pairs before the end of
the if statement. 
That is why the ``\{'' has to be in the same line with 
``\lstinline$if (x == 1)$''.
For the same reason, ``\lstinline$else$'' is in the same line with both 
the leading
``\}'' and the ending ``\{''. 

\subsubsection{Line joining}
The backslash character, ``\textbackslash'', can be used to join two 
physical lines.
This allows a logical line to span across multiple physical lines as follows:
\begin{lstlisting}
# Line joining by "\" character
x = a + b \
      + c \
      + d
\end{lstlisting}

\subsubsection{Comment}
The hash character, ``\#'', starts a single line comment runs to the end
of a physical line. There is currently no plan to support multi-line
comment symbols.

\subsubsection{Blank line and whitespace}
Both space and tab are whitespace. They are used to separate other 
language unit, but are otherwise ignored. Blank lines are ignored.

\subsection{Keyword and identifier}
Keywords are reserved and listed as follows\footnote{Keyword operators 
are shown in green color while normal keywords are shown in sepia. It is 
completely cosmetic with no impact on the language.}:
\begin{lstlisting}
print    if      elif  else   while   for      break  def   
continue return  null  class  import  package  try    raise
catch    finally read  not    and     or       xor 
\end{lstlisting}

An identifier is a character sequence starts with a letter or an underscore 
and followed by zero or more letters, numbers or underscore and it cannot
be any of the reserved keywords.

\subsubsection{Speical identifier}
A special identifier, ``_'' (without the quotes), always points
to the value of last unused expression.
\begin{lstlisting}
# Speical identifier
1 + 1
a = 2 + 2
print 3 + 3
\end{lstlisting}
At the end of the above code snippet, the value of ``_'' is 2. This is because 
both \lstinline$2 + 2$ and \lstinline$3 + 3$ are used by an assignment and
a print statement, respectively. User can directly assign a value to the special
identifier. In doing so, any value it held previously is overwritten.

\subsection{Number}
Numbers only come in two types, integer and floating point numbers.
Both numbers can be arbitrary large as long as the computer memory
can hold them. Only decimal numbers are supported. Floating point
number can be written in scientific notation form, e.g. 1.0e+20.

\subsection{Operator}
Certain keywords are also operators, such as \lstinline$and$, 
\lstinline$or$, \lstinline$xor$, \lstinline$not$. Other operators
includes mathematical operators and symbols for aiding formation
of language structs. They are show as follows:
\begin{lstlisting}
+ - * / % ** ( ) { } [ ] , : ;
\end{lstlisting}
Note that the unprintable EOL symbol is not shown in the above list,
though it is also an operator. 

\subsection{Token}
Tokens are recognized and returned by a lexer. A token can be a keyword,
an identifier, a number or an operator. The token definitions are listed
as follows:

\begin{verbatim}
IDENT       ::= (Letter| Underscore) 
                    (Letter | Digit | Underscore)*

STRING      ::= "Valid_Ascii_Sequence" | 'Valid_Ascii_Sequence'

KEYWORD     ::= print | if | else | for | while | continue 
              | break | def | class | and | or | xor | not
              
NUMBER      ::= [+ | -] Pos_Number

Pos_Number  ::= [0-9]+ 
              | [0-9]* . [0-9]+ [Expo] 
              | [0-9]+ . [0-9]* [Expo]

Expo        ::= e [+ | -] [0-9]+
\end{verbatim}

Note that only full capitalized words are tokens. Camel case words
are only here to aid reading and clarity. 
Operators and other characters are returned as tokens named by their lexme.

\subsection{Lexer}
A lexer reads the input source file and generate tokens based on 
the input character stream. The lexer return one token every time
it is called. A symbol table is used to reserve keywords and save
identifier information.

\pagebreak


\section{Syntax analysis}
A parser does the syntax analysis of source code and produce a parse tree
as its output. The parser calls the lexer for a stream of tokens and 
analyze them with a grammar.

\subsection{grammar}
The grammar is of the LL(1) class, which can be parsed by a hand written 
predictive recursive descendant parser. 
A grammar $G$ is LL(1) if and only if whenever $A \rightarrow \alpha\ |\ \beta$ 
are two distinct productions of $G$, the following conditions hold:
\begin{itemize}
\item FIRST($\alpha$) and FIRST($\beta$) are disjoint sets. This also indicates
at most one of $\alpha$ and $\beta$ can derive the empty string.
\item If $\varepsilon$ is in FIRST($\alpha$), then FIRST($\beta$) and 
FOLLOW($A$) are disjoint sets.
\end{itemize}

The BNF notations of the Emma language grammar are shown as follows:
\footnote{Note that left factoring is needed to properly parse 
following grammar as LL(1).}

\setlength{\grammarparsep}{10pt plus 1pt minus 1pt} % increase separation between rules
\setlength{\grammarindent}{12em} % increase separation between LHS/RHS 
\begin{grammar}

% We could let lexer to do more work on merge consecutive ';' as a single ';'

<file_input> ::= <statement>* "ENDMARK"

<prompt_input> ::= <statement>

<string_input> ::= [<simple_stmt>] "ENDMARK"

<statement> ::= "EOL" 
    \alt <simple_stmt> "EOL"
    \alt <compound_stmt> "EOL"


<simple_stmt> ::= <expr>
	\alt <assign_stmt>                    
	\alt <print_stmt>
	\alt <read_stmt>
	\alt <continue_stmt>
	\alt <break_stmt>
	\alt <return_stmt>
    \alt <del_stmt>
	\alt <package_stmt>
	\alt <import_stmt>
    \alt <raise_stmt>

<expr> ::= <r_expr>
	
<print_stmt> ::= `print' [`>' <primary>] [<expr_list>]

<read_stmt> ::= `read' [`<' <primary>] <expr_list>

<assign_stmt> ::= <expr> `=' <expr>\footnote{Additional rules are enforced for the assign_stmt by the interpreter so that only expressions that have valid left values can be used on the left side of the = symbol.}

<continue_stmt> ::= `continue'

<break_stmt> ::= `break'

<return_stmt> ::= `return' [<expr>]

<del_stmt>  :: `del' "IDENT" (`,' "IDENT")*

<package_stmt> ::= `package' "IDENT"

<import_stmt> ::= `import' "IDENT" (`.' "IDENT" | `*')*\footnote{
Additional rule is enforced by the interpreter to make sure the asterisk
can only appear once at the end of the line or not present at all.}

<raise_stmt> ::= `raise' [<primary>]

<compound_stmt> ::= <if_stmt>
	\alt <while_stmt>
	\alt <for_stmt>
	\alt <funcdef>
    \alt <classdef>
    \alt <try_stmt>

<if_stmt> ::= `if' <expr> <suite> [`elif' <expr> <suite>] [`else' <suite>]
              
<while_stmt> ::= `while' <expr> <suite>

<for_stmt> ::= `for' "IDENT" `=' <for_expr> <suite>

<for_expr> ::= <expr> `,' <expr> [`,' <expr>]

<funcdef> ::= `def' "IDENT" `(' [<parmlist>] `)' <suite>

<classdef> ::= `class' "IDENT" `(' ["IDENT"] `)' <suite>

<try_stmt> ::= `try' <suite> (<catch_stmt>)$^{+}$ [<finally_stmt>]

<catch_stmt> ::= `catch' `(' "IDENT" `,' "IDENT" `)' <suite>

<finally_stmt> ::= `finally' <suite>

<suite> ::= <simple_stmt> | <stmt_block>

<stmt_block> ::= `{' "EOL" <statement>* `}'
	\alt `{' `}'\footnote{The empty pair of curly brackets is a syntax sugar
to define an empty stmt_block. It is similar to python's pass statement.}


<expr_list> ::= <expr> (`,' <expr>)*

<r_expr> ::= <r_term> (<r_orop> <r_term>)*

<r_term> ::= <r_factor> (<r_andop> <r_factor>)*

<r_factor> ::= [`not'] <r_factor> | <l_expr>

<l_expr> ::= <a_expr> (<l_op> <a_expr>)*

<a_expr> ::= <a_term> (<addop> <a_term>)*

<a_term> ::= <factor> (<mulop> <factor>)*

<factor> ::= <unary_op> <factor> | <power>

<power> ::= <primary> [`**' <factor>]

<primary> ::= <atom> <trailer>*

<atom> ::= "IDENT" | <literal> | '(' expr ')'

<trailer> ::= `(' [<arglist>] `)' 
	\alt `[' <subscription> `]'
	\alt `.' "IDENT"

<literal> ::= "STRING" | "NUMBER" | "NULL"


<parmlist> ::= <oparm_list> [`,' `*' "IDENT"] [`,' `**' "IDENT"]
    \alt `*' "IDENT" [`,' `**' "IDENT"]
    \alt `**' "IDENT"

<oparm_list> ::= <oparm> (`,' <oparm>)*

<oparm> ::= <kvpair> | "IDENT"

<kvpair> ::= <expr> `=' <expr>\footnote{Similar to the assign_stmt, additional
rules are enforced to ensure the left expr has valid left value.}

<arglist> ::= <oarg> (`,' <oarg>)*

<oarg> ::= <kvpair> | <expr>

<subscription> ::= <singleidx>
    \alt <idxrange>
    \alt <idxlist>

<singleidx> ::= <expr>

<idxrange> ::= <expr> `:' [<expr>] [`:' [<expr>]]
    \alt `:' [<expr>] [`:' [<expr>]]

<idxlist> ::= <expr> (`,' <expr>)*

<r_orop> ::= `or' | `xor'

<r_andop> ::= `and'

<l_op> ::= `>' | `<' | `>=' | `<=' | `==' | `!='

<addop> ::= `+' | `-'

<mulop> ::= `*' | `/' | `\%'

<unary_op> ::= `+' | `-'

\end{grammar}

\subsection{Parser}

\pagebreak

\section{Intermediate representation}

\pagebreak

\section{Virtual machine}

\pagebreak

\section{Standard library}


\end{document}
